In this report, we will focus on studying the results of the mis-splicing noise analysis of Ataxia RNAseq samples: 46 Cerebellum and 45 Frontal Cortex samples. Samples will be pseudobulked by tissue, status disease and ataxia diagnosis, splitting the analysis in three different levels, including the results of the splicing noise between specific Ataxia diagnosis (i.e. SCA1, FRDA…) and control samples.
Now, the clusters always compare the same number of samples between cases and controls. Procedure explained in methodology.
Added results for Level 2 (AtaxiaSubtype) and references.
Changed the reference transcriptome from Gencode v43 to Gencode v39.
Updated methodology.
For every available BAM file to study, we apply the following steps:
Download and extraction of BAM files: the files are downloaded from s3://ataxia-bulk-rnaseq/nextflow_first_attemp/Star_2_pass_by_indv/output/STAR/align/BAM_files/ and subfolders.
Junction extraction: all junctions are extracted using regtools junction extract after sorting and indexing with samtools. A file is created for each BAM file in BED12 format.
Junction annotation: the junctions are read from the previously created files and merged into a single dataframe of read junctions. We also register the number of reads of each junction in every sample. The junctions located within the ENCODE blacklisted regions v2 are removed. The junction_annot() function from the package dasper is used to annotate the junctions to the Gencode v38 reference transcriptome. All junctions not classified as either novel_donor, novel_acceptor or annotated are ignored. We also remove all junctions smaller than 25bp (base pairs) and annotated introns that are ambiguously assigned to more than one gene.
Junction pairing: by looking for overlaps between the novel junctions and the annotated junctions for each sample, we measure the distance in bp between the novel and reference splice site. The annotated introns that are never associated to a novel junction are considered a never misspliced junctions.
Filtering the distances: we remove the pairings in which a novel junctions are associated to more than one reference intron across different samples. For more information about this process, please see the methods section in Introverse paper [1].
Next, we need to decide on a clustering method to combine and compare different samples. More information in section about clustering.
Measuring the mis-splicing ratio: by adding all novel junction read counts attached to an annotated intron across all samples in which the novel splice was observed, and then dividing by the total number of reads of the annotated intron and the novel junctions across the same set of samples, we obtain a measurement of the mis-splicing ratio for an given annotated intron at both the donor splice site and the acceptor splice site. For more information about the mis-splicing ratio, please see section MSR.
Generation of the DB: two tables are created per each cluster: db_introns and db_novel. Each one contains the relevant information related to reference introns (including the never misspliced introns) and novel junctions. This includes the MaxEntScan scores, the percentage of protein-coding transcripts and the classification in u2 and u12 introns.
In our dataset, we have a total of 95 samples, corresponding to 48 different individuals. A total of 4 samples are removed because they belong to individuals diagnosed as CANVAS and AIFM1.
Three different level of studies were studied in this report, and always different analyses for each tissue.
Level 1 (Type): whether the sample is diagnosed with ataxia or not (diases status). For the frontal cortex tissue, samples with \(RIN<4\) will be removed, while for the Cerebellum tissue only controls with \(RIN<=7\) are kept. This is to ensure non-significant differences in the RIN medians.
Level 2 (AtaxiaSubtype): two different analyses are performed: known ataxia cases vs. controls and unknown ataxia cases vs. controls. In both scenarios, no restrictions about RIN is required.
Level 3 (Diagnosis): a different analysis was performed for every ataxia diagnosis with at least three samples: FRDA, SCA1, SCA2 and SCA6. In all situations, control samples are selected to minimize a weighted Gower distance to the case samples (more information in following section).
Studies about the relationship in mis-splicing ratio’s median and the number of samples pseudobulked (reference) showed a clear correlation between the two. In order to avoid this effect in the comparisons between cases and controls, we decided to subsample the majority class until both classes have the same number of samples. The subsampling was performed so that the weighted Gower distance between the case samples and the control samples was minimized.
For datasets with both quantitative (i.e. RIN) and categorical (i.e. Brain bank) variables, also called mixed datasets, Gower’s distance is a common measurement of similarity between any two samples [2]. The similarity between samples can be defined as:
\[ S_{ij}=\frac{\sum_{k=1}^p s_{ijk}\delta_{ijk}w_k}{\sum_{k=1}^p \delta_{ijk}w_k} \] where \(p\) are the variables beeing compared, \(i\) and \(j\) refers to two different samples and:
We decided to apply a weighting to the variables in order to increase the relevance of the main contributors to the number of junction reads across samples.
The subsampling process was the following:
Variable weights: first, we divided the samples by tissue. Then, for each sample, we extract the number of reads associated to both annotated and novel junctions. Using a linear model to predict this number of reads, we measured the variance explained by each of the main covariates that will be employed in subsampling: RIN, PMI, Age at death, Brain bank and sex. The percentage of variance explained by each covariate will be considered as the weight in the next step.
Gower’s distance between samples: for each minority class sample, the most similar majority class sample was selected without repetition (i.e. once two samples were assigned together, none of them can be selected again). Thus, the same number of samples between the two classes were obtained.
Wilcoxon test: between the two sets of samples, a Wilcoxon test was executed to test whether there are significant differences in the RIN medians. If significant differences are found, some additional restrictions are applied to any of the sets. For example, for Cerebellum Level 1 study, control samples with RIN higher than 7 needed to be removed to ensure non-significant differences in the median.
All studies will be presented with the samples employed for both classes and Wilcoxon test results. The obtained weights are the following:
| Covariate | Cerebellum weights | Frontal Cortex weights |
|---|---|---|
| RIN | 0.989 | 0.855 |
| PMI | 0.001 | 0.132 |
| Brain Bank | 0.003 | 0.013 |
| Age at death | 0.001 | 0.000 |
| Sex | 0.006 | 0.000 |
For each study, we will only consider the common annotated introns between the two classes. To do so, we generate the following dataframes:
Common annotated intron table: we looped through both db_introns tables and extracted only the information from the common annotated introns in the clusters. To identify common annotated introns, we used their locus (i.e. seqname:start-end:strand), since it is a unique identifier. The goal is to have the same number of annotated between cases and controls.
Common novel junction table: we looped through both db_novel tables and extracted only the information from the novel junctions associated to common annotated introns. Thus, we first needed to calculate the common annotated intron table.
The examples in this section corresponds to the Level 1 study for Cerebellum.
For each annotated intron, two Mis-Splicing Ratios (\(MSR\)) are calculated to provide a measurement of the mis-splicing frequency at the donor site (\(MSR_D\)) and acceptor site (\(MSR_A\)). We first sum all of the novel donor/acceptor junction read counts and then divide by the sum of all annotated intron and novel junction read counts across the specific samples. It follows this formula:
\[ \begin{equation} MSR_A = \frac{\sum_{i=1}^{N}j_i}{\sum_{i=1}^{N}j_i+\sum_{i=1}^Ns_i } \end{equation} \]
where \(j\) is the number of novel acceptor junction reads for a particular annotated intron, \(s\) is the number of annotated intron reads and \(N\) is the number of samples being studied. We can generate an \(MSR\) table in which each row corresponds to an annotated intron and each column to a cluster. Example of the generated \(MSR_A\) tables:
Once we have calculated the \(MSR_A\) and \(MSR_D\) for every annotated intron and cluster, we use the paired Wilcoxon signed rank test to study if there is a significant variation in the median \(MSR\) in cases vs. controls. To be more precise:
We obtain two p-values (one for each splice site) to reject or not the null hypothesis in favour of the alternative hypothesis. If the p-value is small enough to reject the null hypothesis, we then evaluate the effect size of the difference in medians. Even though effect sizes are complex to interpret, commonly accepted values are: 0.10-0.3 (small effect), 0.3-0.5 (moderate effect) and >=0.5 (large effect).
The studies for cases vs controls for each tissue report the following effect sizes after the Wilcoxon paired signed-rank test:
In general, even if the difference in median is significant across clusters for both tissues and splice sites, the effect sizes are too small to be considered relevant. Results are also shown in the following table:
| Tissue | Splice site | p-value | Effect size | Magnitude |
|---|---|---|---|---|
| Cerebellum | Donor | 0.000 | 0.048 | small |
| Cerebellum | Acceptor | 0.000 | 0.051 | small |
| Frontal | Donor | 0.819 | 0.000 | small |
| Frontal | Acceptor | 0.000 | 0.022 | small |
Distribution of sample RIN for Cerebellum level 1 study:
Control samples with \(RIN > 7\) were removed.
Distribution of sample RIN for Frontal Cortex level 1 study:
All samples with \(RIN <= 4\) were removed.
The studies for ataxia subtype vs. controls for each tissue report the following effect sizes after the Wilcoxon paired signed-rank test:
Results are also shown in the following table:
| Subtype | Tissue | Splice site | p-value | Effect size | Magnitude |
|---|---|---|---|---|---|
| KnownAtaxia | |||||
| KnownAtaxia | Cerebellum | Donor | 0.001 | -0.008 | small |
| KnownAtaxia | Cerebellum | Acceptor | 0.000 | -0.026 | small |
| KnownAtaxia | Frontal | Donor | 0.000 | 0.008 | small |
| KnownAtaxia | Frontal | Acceptor | 0.000 | 0.021 | small |
| UnknownAtaxia | |||||
| UnknownAtaxia | Cerebellum | Donor | 0.000 | 0.096 | small |
| UnknownAtaxia | Cerebellum | Acceptor | 0.000 | 0.134 | small |
| UnknownAtaxia | Frontal | Donor | 0.000 | -0.036 | small |
| UnknownAtaxia | Frontal | Acceptor | 0.000 | -0.015 | small |
Distribution of sample RIN for Cerebellum level 2 study:
Distribution of sample RIN for Frontal Cortex level 2 study:
The studies for diagnosis vs. controls for each tissue report the following effect sizes after the Wilcoxon paired signed-rank test:
Results are also shown in the following table:
| Diagnosis | Tissue | Splice site | p-value | Effect size | Magnitude |
|---|---|---|---|---|---|
| FRDA | |||||
| FRDA | Cerebellum | Donor | 0.000 | 0.035 | small |
| FRDA | Cerebellum | Acceptor | 0.000 | 0.039 | small |
| FRDA | Frontal | Donor | 0.000 | -0.025 | small |
| FRDA | Frontal | Acceptor | 0.000 | -0.028 | small |
| SCA1 | |||||
| SCA1 | Cerebellum | Donor | 0.020 | 0.006 | small |
| SCA1 | Cerebellum | Acceptor | 0.000 | -0.015 | small |
| SCA1 | Frontal | Donor | 0.000 | -0.011 | small |
| SCA1 | Frontal | Acceptor | 0.000 | -0.016 | small |
| SCA2 | |||||
| SCA2 | Cerebellum | Donor | 0.011 | 0.006 | small |
| SCA2 | Cerebellum | Acceptor | 0.000 | 0.024 | small |
| SCA2 | Frontal | Donor | 0.000 | -0.051 | small |
| SCA2 | Frontal | Acceptor | 0.000 | -0.039 | small |
| SCA6 | |||||
| SCA6 | Cerebellum | Donor | 0.000 | -0.039 | small |
| SCA6 | Cerebellum | Acceptor | 0.000 | -0.033 | small |
| SCA6 | Frontal | Donor | 0.000 | 0.034 | small |
| SCA6 | Frontal | Acceptor | 0.000 | 0.052 | small |
Distribution of sample RIN for Cerebellum level 3 study:
Distribution of sample RIN for Frontal Cortex level 3 study:
Previous results (not shown in this report) showed us that there seems to be a relationship between the number of samples being clustered and the median MSR values. This result is to be expected if we analyse the situation. We have already proven in the past that the number of samples increase the number of novel junctions, which increases the likelihood of a mis-splicing event to be found for any particular intron. As such, it is expected that more reference introns have a different than zero MSR value, which would lead to an increase in the median MSR.
The reduction in the proportion of MSR = 0 reference introns was observed for both tissues and splice sites in control samples:
To expand on this idea, we tested a Wilcoxon paired signed-rank test between different control samples. We will call cluster A as the one constructed from three control samples: the highest RIN, the lowest RIN and the median RIN. Then, different clusters will be generated based on different number of samples per each sample in cluster A. Let’s call cluster B1 to the one generated with three samples in total, selected to to minimize the weighed Gower’s dissimilarity index to each of the case samples of cluster A. Cluster B2 then would be created with two samples per sample in cluster A (i.e. 6 samples in total). These will be called clusters B.
Once we have clusters A and B, we execute a Wilcoxon test to compare the median MSR between both clusters and splice sites. Positive results would mean a higher MSR median in cluster A, while negative results represents the opposite. Results are shown in the following figure:
As expected, the effect size decreases as we increase the samples in cluster B. Thus, in the studies in which the number of control samples is bigger than the number off case samples, it is expected that the Wilcoxon tests report a decrease in the median MSR values between case and control samples.
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.1 (2022-06-23)
## os Ubuntu 20.04.4 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2023-06-05
## pandoc 2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.2.0)
## backports 1.4.1 2021-12-13 [1] RSPM (R 4.2.0)
## bit 4.0.5 2022-11-15 [1] RSPM (R 4.2.0)
## bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0)
## bookdown 0.33 2023-03-06 [1] RSPM (R 4.2.0)
## broom 1.0.4 2023-03-11 [1] RSPM (R 4.2.0)
## bslib 0.4.2 2022-12-16 [1] RSPM (R 4.2.0)
## cachem 1.0.8 2023-05-01 [1] RSPM (R 4.2.0)
## car 3.1-2 2023-03-30 [1] RSPM (R 4.2.0)
## carData 3.0-5 2022-01-06 [1] RSPM (R 4.2.0)
## cli 3.6.1 2023-03-23 [1] RSPM (R 4.2.0)
## codetools 0.2-19 2023-02-01 [1] RSPM (R 4.2.0)
## colorspace 2.1-0 2023-01-23 [1] RSPM (R 4.2.0)
## crayon 1.5.2 2022-09-29 [1] RSPM (R 4.2.0)
## DBI 1.1.3 2022-06-18 [1] RSPM (R 4.2.0)
## digest 0.6.31 2022-12-11 [1] RSPM (R 4.2.0)
## doParallel * 1.0.17 2022-02-07 [1] RSPM (R 4.2.0)
## dplyr * 1.1.2 2023-04-20 [1] RSPM (R 4.2.0)
## evaluate 0.21 2023-05-05 [1] RSPM (R 4.2.0)
## fansi 1.0.4 2023-01-22 [1] RSPM (R 4.2.0)
## farver 2.1.1 2022-07-06 [1] RSPM (R 4.2.0)
## fastmap 1.1.1 2023-02-24 [1] RSPM (R 4.2.0)
## forcats * 1.0.0 2023-01-29 [1] RSPM (R 4.2.0)
## foreach * 1.5.2 2022-02-02 [1] RSPM (R 4.2.0)
## generics 0.1.3 2022-07-05 [1] RSPM (R 4.2.0)
## ggforce 0.4.1 2022-10-04 [1] RSPM (R 4.2.0)
## ggplot2 * 3.4.2 2023-04-03 [1] RSPM (R 4.2.0)
## ggpubr 0.6.0 2023-02-10 [1] RSPM (R 4.2.0)
## ggsignif 0.6.4 2022-10-13 [1] RSPM (R 4.2.0)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
## gtable 0.3.3 2023-03-21 [1] RSPM (R 4.2.0)
## here * 1.0.1 2020-12-13 [1] RSPM (R 4.2.0)
## highr 0.10 2022-12-22 [1] RSPM (R 4.2.0)
## hms 1.1.3 2023-03-21 [1] RSPM (R 4.2.0)
## htmltools 0.5.5 2023-03-23 [1] RSPM (R 4.2.0)
## httr 1.4.5 2023-02-24 [1] RSPM (R 4.2.0)
## iterators * 1.0.14 2022-02-05 [1] RSPM (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
## jsonlite 1.8.4 2022-12-06 [1] RSPM (R 4.2.0)
## kableExtra 1.3.4.9000 2023-01-30 [1] Github (haozhu233/kableExtra@292f607)
## knitr 1.43 2023-05-25 [1] RSPM (R 4.2.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
## lattice 0.21-8 2023-04-05 [1] RSPM (R 4.2.0)
## lifecycle 1.0.3 2022-10-07 [1] RSPM (R 4.2.0)
## lpSolve 5.6.18 2023-02-01 [1] RSPM (R 4.2.0)
## lubridate * 1.9.2 2023-02-10 [1] RSPM (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
## MASS 7.3-59 2023-04-21 [1] RSPM (R 4.2.0)
## Matrix 1.5-4 2023-04-04 [1] RSPM (R 4.2.0)
## mitools 2.4 2019-04-26 [1] RSPM (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
## pillar 1.9.0 2023-03-22 [1] RSPM (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
## polyclip 1.10-4 2022-10-20 [1] RSPM (R 4.2.0)
## proxy 0.4-27 2022-06-09 [1] RSPM (R 4.2.0)
## purrr * 1.0.1 2023-01-10 [1] RSPM (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
## Rcpp 1.0.10 2023-01-22 [1] RSPM (R 4.2.0)
## readr * 2.1.4 2023-02-10 [1] RSPM (R 4.2.0)
## rlang 1.1.1 2023-04-28 [1] RSPM (R 4.2.0)
## rmarkdown 2.21 2023-03-26 [1] RSPM (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
## rstatix 0.7.2 2023-02-01 [1] RSPM (R 4.2.0)
## rstudioapi 0.14 2022-08-22 [1] RSPM (R 4.2.0)
## rvest 1.0.3 2022-08-19 [1] RSPM (R 4.2.0)
## sass 0.4.6 2023-05-03 [1] RSPM (R 4.2.0)
## scales 1.2.1 2022-08-20 [1] RSPM (R 4.2.0)
## sciRmdTheme 0.3 2023-03-10 [1] local
## sessioninfo * 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
## StatMatch 1.4.1 2022-03-01 [1] RSPM (R 4.2.0)
## stringi 1.7.12 2023-01-11 [1] RSPM (R 4.2.0)
## stringr * 1.5.0 2022-12-02 [1] RSPM (R 4.2.0)
## survey 4.2-1 2023-05-03 [1] RSPM (R 4.2.0)
## survival 3.5-5 2023-03-12 [1] RSPM (R 4.2.0)
## svglite 2.1.1 2023-01-10 [1] RSPM (R 4.2.0)
## systemfonts 1.0.4 2022-02-11 [1] RSPM (R 4.2.0)
## tibble * 3.2.1 2023-03-20 [1] RSPM (R 4.2.0)
## tidyr * 1.3.0 2023-01-24 [1] RSPM (R 4.2.0)
## tidyselect 1.2.0 2022-10-10 [1] RSPM (R 4.2.0)
## tidyverse * 2.0.0 2023-02-22 [1] RSPM (R 4.2.0)
## timechange 0.2.0 2023-01-11 [1] RSPM (R 4.2.0)
## tweenr 2.0.2 2022-09-06 [1] RSPM (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
## utf8 1.2.3 2023-01-31 [1] RSPM (R 4.2.0)
## vctrs 0.6.2 2023-04-19 [1] RSPM (R 4.2.0)
## viridis 0.6.3 2023-05-03 [1] RSPM (R 4.2.0)
## viridisLite 0.4.2 2023-05-02 [1] RSPM (R 4.2.0)
## vroom 1.6.3 2023-04-28 [1] RSPM (R 4.2.0)
## webshot 0.5.4 2022-09-26 [1] RSPM (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
## xfun 0.39 2023-04-20 [1] RSPM (R 4.2.0)
## xml2 1.3.4 2023-04-27 [1] RSPM (R 4.2.0)
## yaml 2.3.7 2023-01-23 [1] RSPM (R 4.2.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────